Influence of founder number on haploRILs performance
# Create individual plotsplot1 = avr_metrics %>% dplyr::group_by(GE_rate, n_founders) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(n_founders), y = prc, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="n_founders", y ="Precision (prc)", fill ="GE Rate") +theme_minimal()plot2 = avr_metrics %>% dplyr::group_by(GE_rate, n_founders) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(n_founders), y = rcl, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="n_founders", y ="Recall (rcl)", fill ="GE Rate") +theme_minimal()print(plot1)
print(plot2)
Influence of individual parameters on haploRILs performance (16 founders)
# Create individual plotsplot_K_prc = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, K) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(K), y = prc, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="K", y ="Precision (prc)", fill ="GE Rate") +theme_minimal()plot_nSnp_prc = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, nSnp) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(nSnp), y = prc, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="nSnp", y ="Precision (prc)", fill ="GE Rate") +theme_minimal()plot_step_prc = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, step) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(step), y = prc, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="step", y ="Precision (prc)", fill ="GE Rate") +theme_minimal()plot_K_rcl = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, K) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(K), y = rcl, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="K", y ="Recall (rcl)", fill ="GE Rate") +theme_minimal()plot_nSnp_rcl = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, nSnp) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(nSnp), y = rcl, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="nSnp", y ="Recall (rcl)", fill ="GE Rate") +theme_minimal()plot_step_rcl = avr_metrics %>% dplyr::filter(n_founders ==16) %>% dplyr::group_by(GE_rate, step) %>%summarise(prc =mean(prc), rcl =mean(rcl), .groups ="drop") %>%ungroup() %>%ggplot(aes(x =factor(step), y = rcl, fill =factor(GE_rate))) +geom_bar(stat ="identity", position ="dodge") +labs(x ="step", y ="Recall (rcl)", fill ="GE Rate") +theme_minimal()
# Combine the plots into a 3x2 grid (adjust to 3x6 as needed)# combined_plot = (plot_K_prc | plot_nSnp_prc | plot_step_prc) / (plot_K_rcl | plot_nSnp_rcl | plot_step_rcl)# Display the combined plotprint(plot_K_prc)
print(plot_nSnp_prc)
print(plot_step_prc)
print(plot_K_rcl)
print(plot_nSnp_rcl)
print(plot_step_rcl)
Inspecting best haploRILs parameters
Based on the dotplots and barplots showing combined and individual effects of parameters on haploRILs performance respectively, it seems that the best combination of haploRILs parameters are:
nFounders = 16
haploRILs performed clearly better in terms of precision and recall when the actual set of founders was exclusively used. Previous experience with 100 founders showed boosted crossover numbers, which matches the large numbers of detected crossovers under genotyping errors and probably explains the recall gains under genotyping errors.
nSnp = 12
12-SNP windows produced consistently better average precision values than bigger windows. Despite recall not being so good at 0% GE rate (since the high recalls at higher rates are associated with numerous false positives), the decision on the parameter winners should be based on precision. Thats because the ML model objective would be to classify crossovers from other loci, so the input to the model must be as accurate as possible in order to detect features associated with recombination frequency. In other words, safe crossovers are better than non-safe crossovers.
K = 3
At GE rate 0% the influence of K is negligible, but under realistic scenarios, higher K values improve performance sustantially. I would say that higher K values could be contemplated depending on the CO numbers obtained.
step = 1
The new parameter did not produce the results expected, since steps causing window overlap (step = 2) performed consistently worse than the non-overlapping (step = 1). However, the effect could be positive in some scenarios. More overlap also has a sustantial effect on resolution.
winning_metrics_long = winning_metrics %>%pivot_longer(cols =c(prc, rcl), names_to ="Metric", values_to ="Value")# Create the barplotggplot(winning_metrics_long, aes(x = GE_rate, y = Value, fill = Metric)) +geom_bar(stat ="identity", position ="dodge") +labs(x ="GE_rate", y ="Value", title ="Precision and Recall by GE rate") +theme_minimal() +scale_fill_manual(values =c("prc"="blue", "rcl"="red"))
avr_metrics_long = alt_metrics %>%pivot_longer(cols =c(prc, rcl), names_to ="Metric", values_to ="Value")# Create the barplotggplot(avr_metrics_long, aes(x = GE_rate, y = Value, fill = Metric)) +geom_bar(stat ="identity", position ="dodge") +labs(x ="GE_rate", y ="Value", title ="Precision and Recall by GE rate") +theme_minimal() +scale_fill_manual(values =c("prc"="blue", "rcl"="red"))
# K = 2altParameters =c(16, 12, 1, 2)names(altParameters) =c("n_founders", "nSnp", "step", "K")alt_metrics = avr_metrics %>% dplyr::filter( n_founders == altParameters["n_founders"], nSnp == altParameters["nSnp"], step == altParameters["step"], K == altParameters["K"] )dt_metrics = alt_metrics %>% dplyr::mutate(dplyr::across(c(co_detected, co_simulated, prc, rcl, f1s, res), ~round(., 2)))# Display the interactive tabledatatable(dt_metrics, options =list(pageLength =5, # Number of rows per pageautoWidth =TRUE,scrollX =TRUE,rownames =FALSE))
avr_metrics_long = alt_metrics %>%pivot_longer(cols =c(prc, rcl), names_to ="Metric", values_to ="Value")# Create the barplotggplot(avr_metrics_long, aes(x = GE_rate, y = Value, fill = Metric)) +geom_bar(stat ="identity", position ="dodge") +labs(x ="GE_rate", y ="Value", title ="Precision and Recall by GE rate") +theme_minimal() +scale_fill_manual(values =c("prc"="blue", "rcl"="red"))